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Abstract 



Approximate Bayes Computations (ABC) are used for parameter inference when the 
likelihood function is expensive to evaluate but relatively cheap to sample from. In ABC, 
a population of particles in the product space of outputs and parameters is propagated 
in such a way that its output marginal approaches a delta function at the measured 
output and its parameter marginal approaches the posterior distribution. Inspired by 
simulated annealing, we present a new class of particle algorithms for ABC, based on a 
sequence of Metropolis kernels, associated with a decreasing sequence of tolerances w.r.t. 
the measured output. Unlike other algorithms, our class of algorithms is not based on 
importance sampling. Hence, it does not suffer from a loss of effective sample size due to 
re-sampling. We prove convergence under a condition on the speed at which the tolerance 
is decreased. Furthermore, we present a scheme that adapts the tolerance according to 
the mean and the standard deviation of the distance of the particles from the measured 
output, and the jump distribution in parameter space according to the covariance of the 
population. These adaptations can be interpreted as mean-field interactions between the 
particles. Thus, the statistical independence of the particles is preserved, in the limit of 
infinite sample size. The performance of this new class of algorithms is investigated with 
a toy example, for which we have an analytical solution. 

1 Introduction 

One way of implementing parameter inference in the Bayesian framework is to generate sam- 
ples from the posterior distribution 



where f(0) denotes the prior distribution encoding our knowledge about the parameter vector 
before the experiment and f(y\0) is the likelihood function, that is, the probability density 
of outputs given the parameter vector 6, evaluated at the measurement vector y. Numeri- 
cal methods such as the Metropolis algorithm [8J require many evaluations of the likelihood 
function to generate such a sample. However, for complex stochastic models, the likelihood 
function is often prohibitively expensive to evaluate. Therefore, in recent years, algorithms 

*Eawag, aquatic research, 8600 Diibendorf, Switzerland. 

' Seminar fur Statistik, ETH Zurich, 8092 Zurich, Switzerland. 





have been suggested that generate samples from ([T]) by sampling from the likelihood rather 
than calculating its value. 

As far as we know, the origin of these algorithms is to be found in population genetics. 
Tavare et al. [10] replaced the output of a genetic model by a summary statistic and adopted 
a rejection technique to generate samples from the posterior. Weiss et al. [12] extended this 
method sampling a vector of summary statistics and introducing a tolerance for its distance 
from the observed summary statistics. Thus, their algorithm generates samples from an 
approximate posterior. Algorithms that generate samples from an approximate posterior via 
sampling outputs from the likelihood are nowadays called Approximate Bayes Computations 
(ABC). Marjoram et al. [2] used Markov chains to produce samples from an approximate 
posterior. Their algorithm combines a random walk in parameter space with drawing from the 
likelihood and an acceptance/rejection step that accounts for the prior and only accepts moves 
into an e ball around the target y. However, a small static tolerance leads to a high rejection 
rate. Therefore, Toni et al. [H] suggested using a decreasing sequence of tolerances and letting 
a population of particles of constant size N evolve towards an approximate posterior. Their 
algorithm consists of an iteration of importance sampling steps, where each iteration consists 
of drawing a new population from the old one with weights and subsequent re- weighting. This 
re-weighting leads to a loss of effective sample size at each iteration step and, furthermore, 
computational costs of the order 0(N 2 ). An adaptive version of Toni's algorithm, which uses 
the empirical variances of the population to adapt the jump distribution in parameter space, 
was presented by Beaumont [3]. All of the mentioned algorithms generate samples from the 
probability distribution proportional to /(0)/(x|0)x(e — p(x, y)), where p is some metric on 
the output space and \ denotes the Heaviside function whose value is unity if its argument 
is non-negative and otherwise. 

In this paper, we present a new class of (adaptive) population algorithms that are of 
order O(N) and do not suffer from a loss of effective sample size. The idea is to start with a 
population of particles drawn from an arbitrary distribution (e.g. the prior) in the product 
space of parameters and outputs and apply a sequence of Markov kernels, (Pe k ), each of which 
having 

Z- 1 ( efc )/(0)/(x|^) e -^)^ 

as equilibrium distribution. A choice for P t , which does not require evaluation of f(x\0) for 
simulation is given in The key question is then how fast we should decrease e k in order 
to have a fast convergence and at the same time not to acquire an additional bias due to a 
too fast convergence. We will give a convergence proof for a schedule that satisfies 

e k > const k~ a ' n , 

where n is the dimension of the output space and a is defined in Q. Furthermore, we will 
present an adaptive schedule that attempts to stay close to equilibrium, at all times. Both 
the jump distribution in parameter space and the tolerance e are adapted using the empirical 
covariance of the population in parameter space and both the average and standard deviation 
of the distance from the target, respectively. The adaptation of e we suggest was developed 
for simulated annealing and motivated from thermodynamics [9j. It is a heuristic scheme, 
for which there is no convergence proof, yet. The adaptation can be interpreted as a mean- 
field interaction between the particles. As a consequence, the particles remain statistically 
independent, in the limit of an infinite sample size. 
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The tolerance e that can be achieved in reasonable time is limited by the dimension of the 
output space. This deficiency is inherent to all ABC algorithms simply because drawing an 
output from an e-ball around y scales like e n . Methods to reduce this bias are investigated 
elsewhere (see, e.g., Leuenberger et al. [6]). 



The paper is organized as follows: In Subsect. 2.1, we explain the main idea behind our 



class of algorithms. In Subsect. 2.2 the explicit scheme together with a convergence proof is 



given. The adaptive scheme is developed in Subsect. 2.3, at the end of which a version of it 
is provided in pseudo-code for convenience. In Sect. 2.4, a comparison with the Metropolis 
algorithm and population algorithms that are based on importance sampling is made. Sect. 
[3] contains an application to a toy model, for which the posterior is available analytically. 

2 A new class of ABC algorithms 

2.1 Basic idea 

Our aim is to sample from the posterior distribution Q, without evaluating the likelihood 
function. The basic idea of ABC is to rewrite ([!]) as the marginalization 

fpost(e\y) oc j /(x|0)/(0)5(x - y)dx (2) 

and sample from the joint density f(x\O)f(0)5(x — y) in the (0, x)-space, O x X. If the output 
space has a high cardinality or is continuous, sampling from f(x\0)f(0)5(x — y) becomes 
inefficient or impossible, respectively. In these cases, we approximate it by the following 
family of distributions 

7Te(0,x) = ^/(x|0)/(0)e-^/ £ , (3) 

where p(x, y) measures how close x is to the observation y. For simplicity, we set X = W l 
and 

1 - 

p(x,y) = - Vlxi-^r, (4) 
a 

i=l 

for some a > 0, but our results could easily be extended to more general manifolds equipped 
with distance measures obeying suitable regularity conditions. This might become necessary, 
if summary statistics are used to map the output space to some smaller-dimensional manifold 
(see, e.g., [TO], [H]). 

Under the assumption that f(x\0) is uniformly bounded and, as a function of x, continu- 
ous at y, 7r e converges weakly to f(x\0)f(0)5(x — y)d0dx, for e \ 0. Our idea is to choose a 
family of Markov transition kernels (P e ) on the space x X, which have n e as stationary dis- 
tribution and apply them recursively on members of a sample drawn from an arbitrary initial 
distribution, for a decreasing sequence of e's. If e is decreased sufficiently slowly, we expect 
to end up with an approximate sample from the posterior distribution. This is analogous to 
the simulated annealing algorithm, although in simulated annealing the limiting distribution 
is usually concentrated on a finite set. Still, we will strongly rely on ideas developed in the 
context of simulated annealing. The transition kernels (P e ) that we will use are defined by 
the transition densities 

q e ((0',x'),(0,x)) = k(0',0)f(x\0)mm 1 1, J f ^/ )e _ p{x , iy)/e j , (5) 
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combined with a multiple of a Dirac delta distribution at (0', x') such that P e ((d' , x'), G x X) = 
1. Here, k is a symmetric transition density on G. It is straightforward to check that 7r e is 
the equilibrium distribution for P e . 

The main question now is how fast e should be decreased. Obviously, an arbitrarily slow 
decrease of e allows to stay arbitrarily close to equilibrium at all times, which guarantees 
convergence. However, this is clearly inefficient. On the other hand, a decrease that is much 
faster than the relaxation velocity of the transition kernel may result in slow convergence 
(because the acceptance probability decreases for decreasing e) or convergence to a biased 
result. A bias can be the result of the process not having enough time to explore the G 
space while converging in the X space. For instance if we set e = 0, (0, x) is accepted iff 
p(x,y) < p(x',y). Hence in this case, the prior has no influence, which leads to convergence 
to a biased result. 

In the next subsection we will present an explicit schedule (e/%) that ensures convergence 
to an unbiased result. A potentially better performance can be achieved when the state of 
the system is used to adapt the tolerance e and the jump distribution k. This idea will be 



developed in Subsect. 2.3 



2.2 An explicit scheme with convergence proof 

In this subsection, we use a time discrete description. That is, we start with a sample from 
an arbitrary distribution fio and then recursively make transitions of the whole sample with 
the kernel P 6k , for an explicitly given decreasing sequence e& \ 0. In this way, we generate 
samples distributed according to 

/ifc+i = p k Pe k+1 = J Pe k+1 (0,x;.)dfi k {0,-x). (6) 

We expect that for a suitable choice of (ek), will converge weakly to f(x\0)f(0)8(x— y)d0dx, 
and thus in particular the marginal will converge weakly to the posterior distribution Q. 
In order to ease the notation we set z = (0 T ,x T ) T and write, for the joint prior, 

f(z) :=f(x\0)f(0). 

Furthermore, w.l.o.g. we will assume y = and replace p(x,y) by p(x). For our main result, 
we make the following assumptions about the parameter space G and the functions k(0',0), 
f(9) and /(x|0) thereon: 

(Al) 3ci > 1 such that q 1 < f(0)/f(0') < a, for all 0,0' G Q. 
(A2) 3c 2 > such that k(0',0) > c 2 f(0), for all 0,0' £ G. 

(A3) f(x\0) is continuously differentiable w.r.t. x for all 0, and the function and all partial 
derivatives are bounded uniformly in x and 0. 

These conditions essentially restrict the parameter space to be compact. We will in fact 
prove stronger than weak-convergence results, namely convergence in total variation of the 
distributions of (0, e k ^"x), with a as defined in (4). The densities of these scaled distributions 
are 

//) \ n/a //, 1/a x 

/i, fc (0,x) := e k ' Hk\0,% x) 
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and 



where 



C{e 1/a ) = J /(e 1/Q x|0)/(0)exp(-p(x))dz, 

and the transition densities for the scaled variables are 

q el/a (z, z') = e n / a q e ((0, e^x), (6', e^x')) . 

Theorem 2.1. If the assumptions (Al) - (A3) above are satisfied and if 

e k > const k~ a ' n , (7) 

for an arbitrary constant (where n denotes the dimension of X and a is defined by ^ty), 
then, for any absolutely continuous initial distribution p,Q the distribution jX k converges in 
total variation to ttq{z) oc f pO st(0\0) exp(— p(x)) ; for k — > oo. 

Proof: We will apply corollary (2.34) in [5j. We start by introducing some notation. Let 

7Tfc = TT ek , P k = P f _ k , P s -t = PsP.s+1 ■ ■ - Pt, 

where P t is defined by the transition density q e . 

By assumption (A3) and dominated convergence, 

. f » u . (() . /(O|0)/(0)ex P (-p(x)) 
^k(y,x) -> 7r O (0,Xj - 



J/(O|0)/(0)d9/exp(-p(x))cfe 

pointwise and thus by Scheffe's theorem also in L 1 -norm, that is in total variation. In order 
to deduce 

IIAo-fb:* - ^"o| \ tv — > 0, 
we have to verify conditions (2.31) and (2.33) in |5j. These conditions are 

k 

where 

c(P k ) = sup||P fc (z, .) - P k (z, .)\\ TV , 

z,z' 

and 

y^||7T fc +i -TTfcllrv < oo- (9) 

k 

Replacing e l / a by e, we may set, without loss of generality, a = 1. To get an upper bound 
for c(P e ) we use 

c(P e ) = sup ( 1 - / mm(q e (z ,z),q e (z" ,z))dz ) . 

z',z" \ J J 

By (Al) and (A2), for any z', 

g e (z',z) > e ft ^/(*)/(ex|0)exp(-p(x)) . 
ci 
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Hence we obtain 



/ 



mm(q e (z',z),q £ (z",z))dz > e n —C(e). 

ci 



Because C(e) — > C(0) > as e — > 0, it follows that, for e sufficiently small e, 

c(P«)<l-^a», (10) 

and ([8]) holds for the choice Q. 

In order to show Q, we start with 

By (A3) and the intermediate value theorem, we obtain that 

|/(ex|0) - /(e'x|0)| < const ||x||i|e - e'| 
and, moreover, that C(e) is differentiable with 

|C'(e)| < const J ||x||i exp(— p(x.))dx , 

where const is the bound for the partial derivatives of f{-\6). Hence we find that 

||7Te - ttvIItv < J ||x||iexp(-p(x))dx|e-e / | . 

Therefore ([9]) holds for any sequence (e^) which converges monotonically to zero. 

□ 

2.3 An adaptive scheme 

In this subsection, we adopt an optimal adaptive cooling strategy that has been developed for 
simulated annealing [9]. It is naturally expressed in the language of thermodynamics using 
a continuous time description. We propagate a distribution, /u(z,i), with ^ , which is now 
interpreted as a transition rate and whose tolerance e is time-dependent. Then, the time 
dependence of /i(z,i) is described by the master equation (or Kolmogorov forward equation) 



d 
dt 



MOM) = j (M z V)<7 e (t)( z/ , z ) -^(z,09e(i)( z , z ')) dz/ - (n) 



Now, the idea is to decrease e(t) adaptively in such a way that /i(z,i) stays close to the target 
distribution 

tt(z, t) = Z-\e{t))f{-x\0)f{e)e-^l^ , (12) 
at all times, while minimizing the number of computer iterations needed to reach a given 



final state. Thus, e(t) will be dependent on fi(t) and the master equation (11) will become 
non-linear. As a measure for the number of computer iterations, the total entropy production 
is used. We introduce the notation p(tt) for the mean distance to y under distribution tt, i.e., 
we set 



p(ir(t)) = J p(x,y)vr(z,i)dz, 
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and, analogously, for p(p,(t)). It can be shown that (see, e.g., (2]), as long as the difference 
p{p[t)) — p(ir(t)) is relatively small, minimal entropy production is approximately satisfied if 
e(t) satisfies the differential equation 



deify 
dt 



ve(t) 



(13) 



where v is constant and relatively small. In ( |13| ), C(t) is the derivative of the mean particle 
distance from y w.r.t. e, i.e., 

dp(ir(t)) 



C{t) 



de(t) 



(14) 



If p is interpreted as an energy and e as a temperature then C(t) can be interpreted as a heat 
capacity. Furthermore, n{t) is a relaxation time, namely the time the system's mean energy 
would take to reach the target's mean energy at current velocity. That is, 



T}(t) 



p(n(t)) - p{p{t)) 
dp{iv{t))/dt 



(15) 



In terms of natural time units defined by n(t) and rescaling energy with <r(7r(i)) (see (17)), 



the tuning parameter v can be interpreted as a thermodynamic speed. In accordance with [9j, 
we shall refer to the algorithm presented here as the constant thermodynamic speed ( CTS) 



algorithm. Using eqn. ( 12 ) 



C{t) 



where 



and using 



a 2 (n(t)) 



e 2 (t) ' 
(p(x,y) -p(vr(t))) 2 7r(z,t)dz, 



dt 



Cit) 



dejt) 
dt 



eqn. (13) can be expressed as 



p(n(t)) = p(p(t)) - va(n(t)) 



(16) 



(17) 



(18) 



(19) 



that is, the equilibrium mean energy is kept a constant multiple of the equilibrium standard 
deviation below the current system's mean energy. Assuming that a(p(t)) doesn't vary a lot 



we approximate a(ir(t)) by a(p(t)). Using (19) we can then determine the time dependence 



of p(ir(t)) from p,(t). The time dependence of e(t) is then derived from (16) and (18): 



de(t) e 2 (t) dp(7r(t)) e 2 (t) dp(ir(t)) 



dt 



o- 2 (iT{t)) dt 



a 2 {p(t)) dt 



(20) 



Additionally to the adaptation of e we may also adapt the jump distribution k. We suggest 
to use a Gaussian distribution whose covariance is adapted to the covariance, S, of p{t) w.r.t. 
space, according to 

& = + si. (21) 



The small multiple of the identity in eq. (21) is added to prevent the process from degener- 



ating, and (5 is a tuning parameter of the algorithm. 
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In order to initialize the algorithm, we draw a population from ( |12[ ), for a relatively 
large e = eo, using a rejection technique, i.e., we draw particles (0,x) from f(O)f(x\0) and 
accept them with probability e _p ^ x,y ^ <E °. In order to stay as close as possible to the time 



continuous process defined by (13) we should update a single particle (randomly chosen from 
the population) at a time according to ^ and update the mean fields p(p(t)), a(p(t)) and 
after each update, which can be done using recursion relations as shown in eqs. (25) 



through (|28j. 

There is always a small fraction of particles that get stuck on their way towards the target. 
Therefore, for certain applications, it might be helpful to insert a resampling step every once in 
a while. However, resampling means a loss of effective sample size. Therefore, the population 
needs to be given enough time between two resampling steps to recover. E.g., a resampling 
step can be made after a sufficiently large number of accepted updates. In a resampling step, 
we draw a new population from the old with weights 

where J is a (small) tuning parameter. After this step, the new population is a sample from 
the distribution 

£(z) = n(z)e- p ^' y)5/t . 

To first order in 5, 

pip) « p(p) - 6 -a 2 (p) . (23) 



Maintaining (19) we define e such that 

p(ir~ e ) = p{p) - vo{tti) 



Neglecting terms of order 0(vS) and 0(5 2 ), and using (18) and (23), we then find that 

e«e(l-<5). (24) 
Thus, we suggest the following algorithm: 
1. Initialization of the algorithm: 

(a) Repeat the following steps until a population of TV particles is obtained: 

i. Draw a parameter vector, 6, from the prior. 

ii. Draw an output, x, from the likelihood f(x.\0). 

iii. Accept the particle (6, x) with probability 

g -p(x,y)/e 

for a sufficiently large eo- 

(b) Set the initial mean particle distance of the equilibrium, po, equal to the average 
particle distance p: 



po = p = £p( x ^y)- 
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(c) Calculate the standard deviation of the particles from the target 



1 N 



(d) Set the initial e according to eqn. 

e = e Q 1 

V a 

where v is a small tuning parameter (by default v = 0.1). 

(e) Calculate the parameters' average and empirical covariance according to 



1 N 

-ye. 



1=1 



and 

(f) Initialize an acceptance counter 



a = 0. 

2. Iterate the following steps: 

(a) Draw a random particle, (6j,x.j), from the population. 

(b) Draw an independent proposal parameter from the normal distribution 



where k is calculated according to (21). 

(c) Draw a proposal output, x*, from the likelihood /(x|0*). 

(d) Draw a uniform random number r. 

(e) If 

r < mm {l, exp { j — 

do the following updates: 
i. Update the mean particle distance from the target 



Pnew = Paid + ^(p(x*,y) - p(xj,y)) . (25) 



ii. Update the variance of the distances 

view = oll d + JfZT[(Pold ~ Pnew) ~ JfZ\ ^^3' ~ /A**^)) ' ( 26 ) 
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iii. Update the mean of the parameters 



9 new — Gold + ly(^* ~~ ^j) ' (2?) 



iv. Update the covariance of the parameters 

£ ne „, = E oW + — - - (6g ld o id - Q T new dnew) + N _ ^ ((9*) T 0* - OjOj) ■ (28) 



v. Update the equilibrium particle distance according to eq. ( 19 ) as 



(Po)raeui • — Pnew V(T new . (29) 



vi. Update the tolerance according to eq. (13) as 



2 (Po)oid ~ (po) new <„„i 

Cnew — Gold ~ ^old 2 ' V ' 



®new 



vii. Set 6j = 0* and Xj = x*. 

viii. Increment the acceptance counter a. 

(f) (optional) If a = IN, for a sufficiently large I (e.g. I = 10), draw a new population 



of size ./V from the old one with weights (22) and update the tolerance according 



to ([24]) as 

(■new = £o/d(l — <5) • 



Remark: 

Implementing eqn. (20) does not guarantee that we stay close to equilibrium at all times. The 
risk to drift away from equilibrium naturally increases once the relaxation time r](t) (which 
can be estimated from (15)) isn't small anymore compared to the observation time. If this 
happens, it might be a good idea to either stop the algorithm or then switch to constant e. 



2.4 Comparison with other algorithms 

An important property of our algorithm is that, in the limit iV — > oo, the particles remain 
uncorrelated, if they were drawn independently at the beginning, even though the particles 
interact. This property is called propagation of chaos and is a well known consequence of the 
fact that the interaction is of mean-field type (see, e.g., [3]). 

Whether or not our algorithm is to be preferred over the Metropolis algorithm depends on 
the "degree of stochasticity" of our model and the desired precision of the result: A model is 
said to have a high degree of stochasticity if it is much cheaper to draw a sample from the 
likelihood than to evaluate the likelihood function. In the limit of infinitely many particles, as 
we have just seen, our algorithm yields independent samples from (an approximation of) the 
posterior, whereas the sample generated from the Metropolis algorithm suffers from autocor- 
relation. On the other hand, contrary to Metropolis, the whole history of the particles in our 
population has to be discarded. Which of these (dis-)advantages is more dominant depends 
on the tuning of the algorithms and the desired sample size. However, no matter how well 
our algorithm is tuned, its acceptance rate decreases dramatically at a certain tolerance level. 
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A well tuned Metropolis algorithm, on the other hand, has a constant acceptance rate of 
20 - 50%. 

In our algorithm, the sample size remains constant (except possibly at the few resampling 



steps that were suggested in Subsect. 2.3 ). This is the main advantage compared to population 
methods based on importance sampling, such as [llj or [3], where each importance sampling 
step entails a reduction in effective sample size. Efficiency is further gained replacing \{ e ~ 
p(x,y)) by exp(— p(x,y)/e). With this replacement, moves are not only accepted if they end 
up in an e-ball around the target but they are more likely accepted if they move closer to the 
target. Finally, our algorithm is of the order O(N), whereas importance sampling algorithms 
are of the order 0{N 2 ), due to the weighting step. However, both algorithms scale like 0(N) 
with the number of simulations from the likelihood, which is usually the most costly step. 



3 Toy Example 

In order to test the performance of our algorithm, we consider a case where the analytical 
equilibrium solution as a function of e is available. The prior shall be given by a univariate 
normal distribution, 



f(6) oc exp 



1 



(31) 



The output, y, is assumed to be normally distributed around 9, and we assume to have n 
independent measurements, y = (yi, . . . ,y n )- Thus, the likelihood function reads as 



/(y|6>) oc exp 



8=1 



(32) 



In order to be able to calculate the analytical solution, we set a = 2 in Q, i.e., we set 

1 n 

p( x >y) = -Vif ■ 



4 = 1 



Then, the equilibrium solution (12) is given by the normal distribution 

1 



TT e {V,X.) = 7T t {Z) (X exp 



z -/i) T S _1 (z -/x) 



with 



and 



(1+n) 


-21 T 


-21 





(33) 
(34) 

(35) 



A tedious but straightforward calculation shows that 



(l + e)/(n + l + e) 


e(l + e)n e - 1 l T 


e(l + e)n7 1 l 


Al + vO 



(36) 
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where 

A = e(n + 1 + 2e)n~ l , v = e 2 n~ x , n e = n + 1 + e(n + 2) + e 2 , 

and O denotes the matrix with zeroes on the diagonal and ones on the off-diagonals. From 
this and (34) we find that 

^( ( „; ( ; K tf e)y .) or, 

From this, we read off the posterior's mean 



and variance 



^(e) = ^, (38) 



Var^,(e) = — . (39) 



The mean particle distance is calculated from eq 

n+1 



<p(x, y)) Wt = \Y, + OH ~ Vi-i?) ■ ( 4 0) 



2 

i=2 



Using (36) and (37 (this yields 



(p(x, y)k = ^^+! + 2 ^+^E y 2 i< 1 + e ) V 1 J • ( 41 ) 

We spare the reader with the expression for the heat capacity C(e) but comment on its 
important properties. For small values of Y17=l vf > ^ ls monotonously decreasing as a function 
of e. Thus, it assumes its maximum at e = 0, namely 

C(0) = - . (42) 

For Hi l ar g er than a certain value, C(e) starts forming a peak before it decays to zero, 

for e — > oo. 

In Figs. [T] and [2] we compare the convergence of three different schemes: the CTS scheme 
explained in Subsect. 2.3, the explicit scheme for which we have proven convergence in 
Subsect. |2.2[ and an explicit scheme with a constant and small e. For simplicity, we have 
chosen ?/, = y. Furthermore, the population size was chosen to be N = 1000. For the explicit 
schedule, we've chosen, according to Theorem 2.1, 

e (t) = 6 t- 1/10 , (43) 

because a = 2 and n = 20, for an initial eo = 2.7. (Note that, for large N, we can approxi- 
mate the time continuous process used in this section by a sequence of transitions ^ upon 
multipliying ^ by a small time step At.) For the constant scheme we have chosen e(t) = 0.1. 
The decay of e(t), for these three schedules, is plotted in the left panel of Fig. [TJ The explicit 
scheme ( |43[ ) shows the slowest decay, whereas the tolerance of the constant scheme was chosen 
considerably smaller than the tolerance that is reached by the CTS after a long simulation 
time of 3000./V particle updates. For this example, resampling was found not to lead to a 
significant increase in convergence speed of the CTS schedule and was thus not applied. 
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The right panel in Fig. [T] shows the mean energy (particle distance) as a function of e. 
The solid black line corresponds to the equilibrium and shows the functional dependence of 
p(tt) from e, as given by eq. ( [41] ) . For the CTS schedule, the mean energy of p,(t) is always 



slightly above the mean energy of the target n(t), as dictated by (19). The slope of the bold 



black curve is the heat capacity ( 14 ) . A much simpler adaptive scheme pQ would be to set 

e(t) = C^pim , (44) 



where Co can be interpreted as a constant heat capacity. The schedule (44) is associated 



with a straight line in the right panel of plot [TJ If we don't want the process to converge 
for an e > 0, we have to set Co equal to the maximum of the heat capacity, which means 
that, for Ya=1 Vi no ^ too l ar g e > we have to set Cq = n/2, according to (42). But then, at 



the beginning of the process, we would have a huge discrepancy between p(p) and /o(vr), and, 
therefore, potentially pick up an additional bias. Therefore, (44) is not advised whenever 



p(n e (i)) is strongly non-linear as a function of e. The explicit scheme (43) starts off slightly 



off equilibrium but then quickly reaches equilibrium and stays close to it, as we would expect 
given our convergence proof. The constant schedule is obviously very far from equilibrum, 
at least at the beginning of the algorithm. That this is still so at the end of the simulation 
period is revealed by the left panel of Fig. [2j which shows the convergence of the expectation 
value and the standard deviation of the ^-marginal of //(i), respectively. The bold black lines 



show the corresponding values of the target distribution (12). For the explicit schedule (43) 



and the CTS, mean and standard deviation of p(t) closely follow the corresponding values of 
7r(t). For a constant e, however, still at the end of the simulation time, the expectation value 
of the marginal of p,(t) shows a bias that is much larger than the error given by a finite e and 
a finite population size. The former is given by the upper solid black line in the left panel of 
Fig. ([2} and the latt er can be estimated as 



N 



0.007. 



(45) 



This demonstrates the effect a too fast cooling can have. 

Finally, due to the relatively large output dimension n = 20, all schedules have a relatively 
large bias in the standard deviation, as revealed by the right panel of Fig. ([2]). This, however, 
is a problem inherent to all ABC schedules. 



4 Conclusions 

The interacting particle algorithm presented in this paper has several advantages compared to 
the other Approximate Bayes Computations the authors are aware of. Firstly, our algorithm 
is not based on importance sampling. Consequently, there is no loss of effective sample size 
due to re-sampling. Secondly, the Heaviside function x{ e ~ /°( x >y)) nas been replaced by the 
smooth kernel exp(— p(x, y)/e). Thus, moves are more likely to be accepted if they move 
closer to the target and not only if they move into an e-ball around the target. Thirdly, our 
algorithm not only adapts the jump distribution in parameter space but also the tolerance on 
the output space. The tuning thus restricts to the two parameters v and f3 (apart from the 
resampling parameters S and I). Since these adaptations can be interpreted as a mean- field 
interaction between the particles, the particles remain statistically independent if they were 
drawn independently at the beginning, in the limit of an infinite sample size. 
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Figure 1: The left panel shows the decay of e, as a function of time, for three different 
cooling schedules. The right panel shows the mean energy, as a function of e, for two different 
cooling schedules. The black line in the right panel shows the mean energy of the equilibrium 
solution, as a function of e. We set n = 20 and y = 0.5. 

The disadvantage of the adaptive algorithm presented here is that it is based on a heuristic 
designed to keep the population close to equilibrium at all times, but does not guarantee 
convergence. Dragging along all the particles might lead to the disadvantage that we invest 
too much computation into a small fraction of outliers that got stuck on their way towards 
the target. This can be remedied employing a resampling step every once in a while. 

The biggest disadvantage inherent to all ABC algorithms is that the tolerance leads to 
a bias that grows with the dimension of the output space n. Therefore, it is important to 
use summary statistics to reduce the output dimension or employ local approximations of the 
likelihood, for ABC to be useful for problems with large output dimensions. 
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